
Week 1 lecture notes
Reviewing introductory papers on interpretability and linguistic probes inside the black box of neural language models
The Barest Thought of an Intro to Neural Nets
- A brief recent history of neural networks
- Neural networks are mathematical objects — for doing computation
- The most common types can be boiled down to simple matrix multiplication - the forward pass
- Models can be hard-wired or learned, typically using gradient-based methods like backpropagation
- General framing: The model will try to learn some mapping from the input (e.g., some vector representing the pixels of an image) to an output (i.e., a prediction, such as a single number, or a vector of numbers, such as class probabilities)
- Simplest multi-layer models (e.g., the multilayer perceptron) perform two stages of multiplication — some of which employ nonlinear transformations of an intermediate state
- In some parts of the literature, especially older connectionist modeling papers, these transformations are sometimes called “activation functions”
- Nonlinearities allow models to learn statistically interesting conjunctions of features — XOR problem
- These conjunctions of features are also interactions — the same as the use of the term elsewhere in statistics — in which the value at one level depends on the value at some other level (e.g., a + b + ab)
- Linguistic structure is highly interactive — there are usually multiple sources of information that influence how we interpret language
- 5 years after Mikolov et al. (2013) - foundational word2vec paper, what was the state of research in NLP? A variety of models, e.g., ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), GPT-2 somewhere in there, and many, many more that have surfaced
- Movement from recurrent structures (e.g., RNNs and LSTMs) to attention-based computations using Transformer architectures (e.g., Vaswani et al., 2017)
- Terminological note: RNNs I use to refer to models that only have a recurrence mechanism to hold onto prior hidden states; LSTMs (with forget gates) are a very different architecture; Transformers learn in a similar way to RNNs and LSTMs but have no recurrence; predictions are computed simultaneously
- Neural network models have massively grown in size and numbers of parameters
- Neural networks are mathematical objects — for doing computation
- Big questions about neural networks:
- What is in the input and output of these models? = What encoding representations are we using? What assumptions does using those representations make?
- What can and do the models learn from the data?
- How are they generally trained?
Readings
Alishahi, A., Chrupała, G., & Linzen, T. (2019). Analyzing and interpreting neural networks for NLP: A report on the first BlackboxNLP workshop. Natural Language Engineering, 25(4), 543-557. https://www.cambridge.org/core/journals/natural-language-engineering/article/analyzing-and-interpreting-neural-networks-for-nlp-a-report-on-the-first-blackboxnlp-workshop/FAFF1B645BBF89FE400A521526AA65D4
Notes
- “Octopus paper” - Bender and Koller (2020) (“Climbing towards NLU”)
- Wide variety of reasons to want interpretable models
- Stakeholders in a business
- Accountability for legal reasons (e.g., California or the EU)
- “Black box” —> BlackboxNLP
- Approaches outlined in BlackboxNLP
- Developing annotated and specialized datasets to test models
- Manipulation of the input to neural networks to test for importance of specific linguistic or demographic features
- Developing diagnostic classifiers trained over intermediate representations from within a neural network model
- Modifying neural network architectures to make them more explainable —> Simplify or distill the model to a smaller state
- Designing training or testing datasets over simplified or formal languages
- Input manipulation
- Punctuation
- Tokenization
- Lemmatization
- Chunking
- Datasets
- Diverse NLI - model must answer logical/semantic questions of varying linguistic complexity
- GLUE - Benchmark dataset for different domains
- Human reference points, e.g., children’s behavior in theory-of-mind experiments
- Sentences of varying types of linguistic complexity (e.g., subject-verb agreement tests)
- Developing diagnostic classifiers
- Auxiliary task - Some other task (e.g., sentiment analysis)
- Diagnostic classifiers - Is the presence or absence of a linguistic feature “in” the encoding/embedding/vector representation?
- Can leverage the predictions of diagnostic classifiers to “nudge” a trained model in a more linguistic direction
- Part-of-speech classifiers (e.g., NOUN, ADJ, VERB, PUNCT)
- Subject-verb agreement (“The key(s) [to the cabinet(s)] is/are on the table”)
- Nearest neighbors with a notion of conformity (Wallace, Feng, & Boyd-Graber, 2018) — removing a feature (e.g., a word from a passage) can influence overall representation
- Probing
- Decoding
- Modifying neural network architectures
- Simplified or formal languages
- Cross-linguistic transfer between a large corpus and a small corpus to see how original learned representations do/do not get preserved when training on a “new” language
- Formal languages
- Recognizing whether a string is valid in some formal system or not
- languages require a pushdown automaton — something that can keep track of the location of a previous state — complex interaction between activation functions (ReLU, GRU + LSTM or plain recurrent architecture)
- RNNs and LSTMs perform poorly in parsing Dyck languages (matching open and closed brackets) on strings longer than what they were trained on
- Desirable future links
- Evaluation - “When an explanation matches what a human would see as a reasonable basis of a particular decision, it does not necessarily follow that this was the basis”
- Benchmarks
- Neuroscientific alignment
- Growing area in natural language processing!
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8, 842-866. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00349/96482/A-Primer-in-BERTology-What-We-Know-About-How-BERT
Notes
- Syntactic knowledge
- Define each of the following:
- Linear versus hierarchical structure (e.g., “The cat the dog is sleeping next to is cute”)
- Part-of-speech information (e.g., NOUN, ADJECTIVE, etc.)
- Syntactic chunks (what sequences go together)
- Roles (e.g., subject, object, arguments, adjuncts)
- Named entity categories (memorization)
- Pragmatic inference
- Event knowledge
- Syntactic relations (e.g., syntactic dependencies)
- Subject-verb agreement
- Anaphora
Andreas Madsen, Siva Reddy, and Sarath Chandar. 2022. Post-hoc Interpretability for Neural NLP: A Survey. ACM Computing Surveys. Just Accepted (June 2022). https://doi.org/10.1145/3546577
Notes
- Motivations for interpretability
- “incompleteness in the problem formalization”
- Accountability
- Safety
- Ethics
- Scientific understanding
- Communication strategies in the interpretability literature
- Local explanations (single observations)
- Global explanations (the whole model)
- Class explanations (multiple observations from a single class)
- Intrinsic interpretability
- Post-hoc interpretability - models that are built after an NLP system is trained to interpret its behavior
- Measures of interpretability
- Application-grounded - e.g., higher survival rates when doctors in conjunction with an AI save more lives than doctors (or AIs!) alone
- Functionally-grounded - comparing with other post-hoc methods or intrinsically interpretable model (e.g., a linear model)
- Human-grounded - An estimate of the utility to people in general (vs. researcher intuitions), e.g., the model people choose as the most accurate